Future Web Archive

Natanael Arndt

LSWT 2024

2024-04-18

Overview

The Archiving Life-Cycle, inspired by the Linked Data Lifecycle

Background

Law regarding the German National Library (passed on 22 June 2006): German National Library (DNB) received the task of collecting, cataloguing, indexing and archiving non-physical media works (online publications).

Examples of online publications:

  • electronic magazines, eBooks,
  • university dissertations, digitised content,
  • music files, audio books
  • also websites and databases

Archiving Workflow

Selective Web Crawling

  • Domain Crawl includes all Websites that are reachable under a .de-TLD
  • Event Crawls sites related to e.g. elections, sports events, pandemics
  • Special Crawls on request to preserve individual sites
  • Curation of a List of Seed URLs
  • Highly dynamic Content
  • Social Media
  • Video-Streaming-Platforms
  • Databases
  • Digital Editions
  • Content behind a Paywall, Advertising, Cookie Consent
  • Metadata-Harvesting

  • Replay the archived Webpage to the users
  • Allow Data Driven Access to the Data
  • Ensure Copyright

exclusive access in the reading rooms of DNB via the catalogue or web archive portal

unless the rights holder grants the right to make the archived websites freely accessible worldwide

Partnerships

  • International level:
    • IIPC membership
    • Internet Archive (special access to snapshots of websites in the .de domain)
  • National level:
    • division of labor with regional legal deposit libraries
    • public dialogue, e.g. workshops with other cultural institutions and relevant experts

Research Questions

The Web archive should provide a service to readers e.g. interested people and scientists in the domains of digital humanities, linguists, political science, prosopographic research, arts, data science, computational social science, …

What would be possible research questions that you would like to answer based on a Web archive?

What meta data and what data would you deam relevant to be collected in the catalog?

Contact

Dr. Natanael Arndt

Deutsche Nationalbibliothek

Automatische Erschließungsverfahren, Netzpublikationen